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Abstract 

Window profiles of amino acids in protein sequences are taken as a description of the amino acid 
environment. The relative entropy or Kullback-Leibler distance derived from profiles is used as a measure 
of dissimilarity for comparison of amino acids and secondary structure conformations. Distance matrices 
of amino acid pairs at different conformations are obtained, which display a non- negligible dependence of 
amino acid similarity on conformations. Based on the conformation specific distances clustering analysis 
for amino acids is conducted. 

PACS number(s): 87.10.+e,02.50.-r 

1 Introduction 

The similarity of amino acids (eta) is the basis of protein sequence alignment, protein design and protein struc- 
ture prediction. Several scoring schemes have been proposed based on amino acid similarity. The mutation 
data matrices of Dayhoff [6] and the substitution matrices of Hcnikoff [1] are standard choices of scores for 
sequence alignment and amino acid similarity evaluation. However, these matrices, focusing on the whole 
protein database, pay little attention on protein secondary structures(ss). How the amino acid similarity 
is influenced by different secondary structures is an interesting question. Furthermore, understanding the 
differences can help us in protein sequence analysis. 

1 



Despite efforts in uncovering the information encoded in the primary structure, we still cannot read 
the language describing the final 3D fold of an active biological macromolecule. Compared with the DNA 
sequence, a protein sequence is generally much shorter, but the size of the alphabet is five times larger. A 
proper coarse-graining of the 20 amino acids into fewer clusters for different conformation is important for 
improving the signal-to-noise ratio when extracting information by statistical means. 

It is our purpose to propose a simple scheme to study amino acid similarity from amino acid string 
statistics. Information about the environment for an amino acid at a certain conformation state may be 
provided by statistics of residue strings or windows centered at the amino acid. The success of window-based 
approaches such as GOR [2] for secondary structure prediction validates the use of such statistics. We shall 
derive a measure for the difference of amino acid pairs based on the distance of probability distributions, 
and investigate how the difference is dependent on conformations. 

2 Amino acid distances 

Our discussion will be heavily based on the distance between two probability distributions. A well defined 
measure of the distance is the Kullback-Leibler (KL) distance or relative entropy [7, 8, 9], which, for two 
distributions {pi} and {ft}, is given by 

d({Pi},{u}) = X^ 1o s(k/<7*)- (!) 

i 

It corresponds a likelihood ratio, and, if pi is expanded around (ft, its leading term is the \ 2 distance: 

rfx({K},M)=I>*-*) 2 /P- (2) 

i 

It is often to use the following symmetrized form for the KL distance 

D({ Pi }, {ft}) = |[d({Pi}, {ft}) + d({ft}, fe})]. (3) 

The distributions to be considered here come from window statistics. For a given amino acid residue 
a, = x at the conformation state a in a sequence a\a2 ■ • ■ a% • • •, we take the string a_„ + ia_„ +i+ i • • • Oj • • • a i+n 
of width (2n + 1) as a window. Denote by Nk(y\x, a) the count of residue y at the fc-th site from the center 
of such windows. As in GOR, only the conformation of the central residue is concerned. A quantity derived 
from Nk(y\x, a) is 

N{x,a) = ^2N k (y\x,a), (4) 
v 
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which, as the total count of residue x at the conformation a, is independent of k. The conditional probability 
distribution P k (y\x,a) is estimated as 

The weight matrix M 2 o X 2n with its entries being P k (y\x,a) is the so-called residue profile of x at a. Such 
profiles are used in window-based approaches, e.g. GOR and artificial neural network algorithm [12]. 

We expect that on average the correlation between the central residue and an outer site decays when 
they become far apart in sequence. To examine the correlation, we consider a large window width of 21, i.e. 
n = 10, and take the 'noise' background to be the following average: 
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Q{y\x,a) = \ 



P k (y\x,a) + ^P k (y\x,a) 
Lfe=-io fc=8 



(6) 



The KL distance D k - x ^ a ({P k (y\x, a)}, {Q(y\x, a)}) provides a measure of the correlation between the central 
site and site k. As we shall see, for our purpose of amino acid comparison a narrow window of a strong 
correlation with width of 7 is used to describe amino acid enviroment. 

Using distribution P k (y\x, a) from window statistics to characterize amino acid residues, we define the 
distance of residue pair x and y at the same conformation a as the following sum of KL distances 

D xma = D({P k (z\x,a)},{P k (z\y,a)}). (7) 

fe=±l,±2,±3 

Similarly, to explore the difference of the same residue x at different conformations a and 0, we may define 
the distance 

D a p, x = J2 D({Pk(z\x,a)},{P k (z\x,0)}). (8) 

fe=±l,±2,±3 

By means of the residue pair distances we can further study the classification of amino acids. With the 
KL distance, we may define the cluster distance in a way consistent with that for residue pairs. For example, 
we characterize the cluster consisting of residues x and y by the 'coarse-grained' probability 

ft(^,a)= ^f' a ;ty |y ; a) - w 

N(x,a) +N(y,a) 

We then may define the distance between this cluster and some other residues or clusters. With cluster 
distance defined, the cluster analysis can be used to reduce amino acid alphabets. 
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3 Results 



Our analysis is performed on a data set taken from the database PDB_SELECT[3, 4] of nonredundant protein 
sequences with known structures. The sequences share amino acid identity less than 25%. We keep only 
the non-membrane sequences with their lengths between 80 and 420. The secondary structure assignment 
is taken from the DSSP database [5] . As in GOR, we use the following reduction of the 8 DSSP states to 4 
states of helix(h), sheet(e), coil(c) and turn(t): H,G, I — ► h, E — > e, X, S, B — > c and T — > t. The counts of 
each amino acid at the reduced four different conformation states are given in Table 1. 

We first estimate probability distributions of residues for each central residue at a given conformation. 
At this step, the window width is 21. We then calculate distances Dk- x ,a({Pk{y\x, a)}, {Q(y\x, a)}) of these 
distributions to their corresponding noise distributions. The results are shown in Figs. 1 to 4, each of which 
is for one conformation of the central residue. The 20 curves in each figure correspond to 20 central amino 
acids. Due to the sample size difference, curves are not directly comparable. (Roughly speaking, under the 
null hypothesis of identical distribution the \ 2 distance should be scaled with the sample size, so a small 
sample size would give a relatively large distance.) However, a decay is clearly seen when the site k become 
far away from the center. For more discussions on correlations we refer reader to [10, 11]. As seen from 
most curves of the figures, distances at the 6 sites nearest to the center are significantly larger than those at 
window border sites. We shall use window width of 7 for further comparison of amino acids. 

It is natural to expect that similar residues would have similar window statistics. Thus, the KL distance 
between two residue profiles provides a measure of their similarity, i.e. a small KL distance implies a large 
similarity. We calculate the KL distance matrices D xy . a for residue pairs at different conformations with 
formula (7). The results are given in Tables 2 and 3, where entries have been multiplied by a factor 200. With 
the distributions (9) defined for clusters, we further perform the simplest bottom-up approach of hierachical 
clustering for residues, by starting from 20 clusters of single residues, and then joining two nearest clusters 
step by step until a single cluster is obtained. The results of clustering are given in Tables 4 to 7. Since 
the dendritic trees returned from clustering are less informative, for visualization we introduce graphs where 
vertices are the 20 amino acids, and an edge exists between a pair of amino acids if and only if their distance 
is below some preset threshold. Graphs obtained from the distance matrices are shown in Figs. 5 to 8, where 
vertices with no connecting edges are neglected. 

In sequence pair alignment we often do not have structure information of both sequences. With the 
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structure information ignored, we have the mixed counts 

N k {y\x)=Y J N k {y\x,a), (10) 

a 

from which we calculate the residue pair distances averaged over conformations. The distance matrix ob- 
tained is given in Table 8. We have also calculated distances (8) to compare different conformations. Dis- 
tances between any two conformations for various residues are listed in Table 9. 

4 Discussions 

Figures 1 to 4 illustrate the dependence of outer sites in a window on the center. Although in the KL 
distance we sum up effects on individual residues from the center, we still can see the tendency that the 
center is generally more strongly correlated with the C-terminal sites than N-tcrminal sites. Furthermore, 
we may divide the 20 amino acids into two groups with M, I, L, V, F, Y and W in one, and the remainders in 
the other. They roughly correspond to hydrophobic and hydrophilic groups. It is seen that for the coil and 
turn conformations a hydrophobic center exhibits a stronger correlation with outer sites than a hydrophilic 
center, while for the sheet conformation a hydrophilic center exhibits a stronger correlation. 

It is interesting to make a comparison between the distance matrices obtained here with the commonly 
used BLOSUM62 similarity score matrix. A small distance implies a large similarity score. There are many 
evidences showing the consistency between the distances and scores. For example, residue pairs VI, IL, VL 
and ST have positive BLOSUM scores and at the same time small distances. The graphs in Figs. 5 to 8 
contain two connected subgraphs: one consists of I, L, V, F, Y, and the other consists of S, T. This is 
another evidence of the consistency. Generally, the averaged distance matrix is closer to BLOSUM62 than 
the conformation specific ones. However, there do exist some remarkable differences. For example, residue 
pairs GT, QA, FV with negative scores have rather small distances in either the conformation helix, or sheet 
or coil, while pairs YH and NH with positive scores have rather large distances in the helix conformation. 
Moreover, YH has a large distance in all the four conformations. 

BLOSUM matrices are derived from conserved amino acid patterns called blocks. It is expected that for 
most score entries we should see the consistency in at least one conformation specific distance matrix. For a 
given residue pair, if residue profiles of an amino acid center are very dissimilar for different conformations, 
after averaging over conformations the pair distance would generally become smaller. In this case, BLOSUM 
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scores and conformation specific distance need not be consistent since the former contains no structure 
information. 

Our results show some strong dependence of residue behavior on conformations. For example, the 
distances of pairs CD and SI in helix are about twice higher than in sheet. There are many residue pairs 
displaying strong dependence of distances on conformations. Table 9 views the conformation dependence 
from conformation pair comparison. Indeed, the table indicates that for any conformation pairs there are 
certain residues which behave very differently in the two conformations. However, generally speaking, coil 
and turn are quite similar. 

In comparison of physicochemical properties of amino acids, the abundance of amino acids is not taken 
into consideration. This is also the case for the above defined distances. Other statistical variables including 
the effect of sample size may be introduced. One candidate is the x 2 statistic for identical distributions. 
The analysis using this new statistic is under study. 

We expect that algorithms using multiple conformation specific matrices should work better in sequence 
alignment. The popular Ncedleman-Wunsch algorithm can be modified to include putative conformation for 
each residue. This will be discussed elsewhere. 

This work was supported in part by the Special Funds for Major National Basic Research 
Projects and the National Natural Science Foundation of China. 
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Table 1. Sample sizes of each amino acid residue in different protein secondary structures. 
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Table 2. Amino acid distance matrices for helices (bottom-left) and turns (top-right). Entries have been multiplied by a 



factor 200. 
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Table 3. Amino acid distance matrices for sheets (bottom-left) and coils (top-right). Entries have been 
multiplied by a factor 200. 
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Table 4. Clustering of amino acid alphabets for helices. The first column indicates the number of amino 
acid groups. 
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Table 5. Clustering of amino acid alphabets for sheets. The first column indicates the number of amino 
acid groups. 
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Table 6. Clustering of amino acid alphabets for coils. The first column indicates the number of amino 
acid groups. 
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I 


Y 


M 


H 


P 


W 


c 


18AEKQRSTNGD 


F LV 


I 


Y 


M 


H 


P 


W 


c 


17AEKQRSTNGD 


FLV 


I 


Y 


M 


H 


P 


w 


c 


16AEKQRSTNGD 


FLVI 




Y 


M 


H 


P 


w 


c 


15 A EK Q R ST N G D 


FLV I 




Y 


M 


H 


P 


w 


c 


14 A EKQ R ST N G D 


FLVI 




Y 


M 


H 


P 


w 


c 


13 A EKQR ST N G D 


FLVI 




Y 


M 


H 


P 


w 


c 


12 A EKQRST N G D 


FLVI 




Y 


M 


H 


P 


w 


c 


11 AEKQRST N G D 


FLVI 




Y 


M 


H 


P 


w 


c 


10 AEKQRSTN G D 


FLVI 




Y 


M 


H 


P 


w 


c 


9 AEKQRSTNG D 


FLVI 




Y 


M 


H 


P 


w 


c 


8 AEKQRSTNG D 


FLVIY 






M 


H 


P 


w 


c 


7 AEKQRSTNGD 


FLVIY 






M 


H 


P 


w 


c 


6 AEKQRSTNGDFLVIY 








M 


H 


P 


w 


c 


5 AEKQRSTNGDFLVIYM 










H 


P 


w 


c 


4 AEKQRSTNGDFLVIYMH 












P 


w 


c 


3 AEKQRSTNGDFLVIYMHP 














w 


c 


2 AEKQRSTNGDFLVIYMHPW 














c 
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Tabic 7. Clustering of amino acid alphabets for turns. The first column indicates the number of 
acid groups. 



i Q A r\]\T rr v Q T T3 fl T 
iy A Uvi Ci J\ o 1 n, Li 


v 
I 


TP 

r 


V 


rl 


r< 


T 
1 


n 
r 


rl 


W 




18 A DN E K ST R Q L 


Y 


p 


y 


H 


Q 


J 


p 


M 


w 


Q 


17 A DN EK ST R Q L 


Y 


F 


V 


H 


G 


I 


P 


M 


W 


c 


16 A DNEK ST R Q L 


Y 


F 


V 


H 


G 


I 


P 


M 


W 


c 


15 A DNEKST R Q L 


Y 


F 


V 


H 


G 


I 


P 


M 


w 


c 


14 A DNEKSTR Q L 


Y 


F 


V 


H 


G 


I 


P 


M 


w 


c 


13 ADNEKSTR Q L 


Y 


F 


V 


H 


G 


I 


P 


M 


w 


c 


12 ADNEKSTRQ L 


Y 


F 


V 


H 


G 


I 


P 


M 


w 


c 


11 ADNEKSTRQL 


Y 


F 


V 


H 


G 


I 


P 


M 


w 


c 


10 ADNEKSTRQLY 




F 


V 


H 


G 


I 


P 


M 


w 


c 


9 ADNEKSTRQLYF 






V 


H 


G 


I 


P 


M 


w 


c 


8 ADNEKSTRQLYFV 








H 


G 


I 


P 


M 


w 


c 


7 ADNEKSTRQLYFVH 










G 


I 


P 


M 


w 


c 


6 ADNEKSTRQLYFVHG 












I 


P 


M 


w 


c 


5 ADNEKSTRQLYFVHGI 














P 


M 


w 


c 


4 ADNEKSTRQLYFVHGIP 
















M 


w 


c 


3 ADNEKSTRQLYFVHGIPM 


















w 


c 


2 ADNEKSTRQLYFVHGIPMW 


















c 



Tabic 8. Amino acid distances ignoring conformation. 
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21 
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25 


5 
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25 


9 


11 




























A 


29 


12 


12 


16 


























G 


21 


8 


11 


11 


11 
























N 


25 


7 


9 


13 


12 


8 






















V 


32 


9 


9 


15 


10 


11 


6 




















E 


40 


18 


18 


21 


11 


18 


14 


9 


















Q 


34 


12 


12 


18 


8 


14 


10 


9 


8 
















H 


21 


13 


14 


17 


18 


14 


12 


15 


23 


17 














R 


31 


11 


13 


16 


7 


13 


11 


10 


9 


5 


15 












K 


35 


15 


14 


18 


12 


16 


10 


9 


8 


10 


22 


8 










M 


33 


19 


16 


20 


10 


17 


18 


18 


19 


16 


24 


15 


18 








I 


25 


16 


13 


16 


12 


14 


16 


17 


20 


18 


19 


16 


15 


10 






L 


26 


16 


14 


17 


9 


14 


16 


17 


19 


15 


20 


14 


15 


8 


4 




V 


24 


10 


9 


13 


8 


9 


11 


12 


15 


13 


17 


12 


12 


10 


6 


6 


F 


22 


13 


11 


16 


13 


11 


14 


16 


20 


18 


18 


16 


15 


12 


6 


6 


Y 


24 


9 


9 


13 


13 


10 


11 


14 


19 


15 


14 


15 


14 


13 


8 


9 


W 


32 


20 


19 


20 


21 


17 


22 


25 


29 


23 


24 


24 


27 


18 


14 


13 




C 


S 


T 


P 


A 


G 


N 


D 


E 


Q 


H 


R 


K 


M 


/ 


L 
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Figure 1 : KL distances (doubled) of outer sites from their corresponding noise background. Each curve is for 
an amino acid at the center labeled 0, whose conformation is turn. For clarity, the curves for M,I,L,V,F,Y 
and W have been shifted up by multiplying an extra factor 100. 



Table 9. Conformation pair distances for each amino acid. Entries have been multiplied by a factor 200. (h: Helix, e: 
Sheet, c: coil, t: Turn.) 





he 


he 


ht 


ec 


et 


ct 


c 


133 


185 


163 


127 


197 


139 


s 


93 


129 


124 


93 


148 


73 


T 


98 


120 


131 


103 


175 


96 


P 


172 


118 


121 


89 


233 


116 


A 


112 


148 


127 


122 


149 


73 


G 


79 


101 


80 


91 


107 


57 


N 


126 


145 


118 


106 


152 


76 


D 


149 


137 


149 


93 


174 


81 


E 


159 


152 


138 


109 


192 


73 


Q 
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157 


133 


93 


143 


93 


H 


100 


150 


110 


117 


152 


98 


R 


131 


146 


128 


91 


144 


85 


K 
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149 


128 


93 


155 


88 


M 


130 


161 


147 


126 


156 


135 


1 


138 


180 


134 


118 


130 


110 


L 


143 


162 


113 


127 


148 


98 


V 


114 


151 


151 


98 


147 


101 


F 


120 


150 


111 


107 


115 


88 


Y 


95 


147 


96 


111 


117 


80 


W 


120 


181 


201 


123 


173 


111 
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Figure 2: KL distances (doubled) of outer sites from their corresponding noise background. Each curve is 
for an amino acid at the center labeled 0, whose conformation is coil. For clarity, the curves for M,I,L,V,F,Y 
and W have been shifted up by multiplying an extra factor 100. 




Figure 3: KL distances (doubled) of outer sites from their corresponding noise background. Each curve is for 
an amino acid at the center labeled 0, whose conformation is sheet. For clarity, the curves for M,I,L,V,F,Y 
and W have been shifted up by multiplying an extra factor 100. 
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Figure 4: KL distances (doubled) of outer sites from their corresponding noise background. Each curve is for 
an amino acid at the center labeled 0, whose conformation is helix. For clarity, the curves for M,I,L,V,F,Y 
and W have been shifted up by multiplying an extra factor 100. 




D<10 -10<D<15 - 15<D<20 



Figure 5: Connecting graph of amino acids in helix. Edges exist only between vertices with a scaled distance 
not greater than 20. Vertices without any connecting edges are not shown. 
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£><10 



10<£><15 



15<Z><20 



Figure 6: Connecting graph of amino acids in sheet. Edges exist only between vertices with a scaled distance 
not greater than 20. Vertices without any connecting edges are not shown. 




D<10 10<£><15 ~15<D<17 



Figure 7: Connecting graph of amino acids in coil. Edges exist only between vertices with a scaled distance 
not greater than 17. Vertices without any connecting edges are not shown. 
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D<25 25<D<30 -30<D<35 



Figure 8: Connecting graph of amino acids in turn. Edges exist only between vertices with a scaled distance 
not greater than 35. Vertices without any connecting edges are not shown. 
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